Analysis of PhD application

Abstract

In this Jupyter notebook I used the PhD applications dataset. I used basic and known machine learning methods to predict if the student is accepted, waitlisted or rejected. The main goal of this project is to predict rating score of the students.

Table of Contents

  1. Import Libraries
  2. Introduction
    • Data Set
    • Variable Description
  3. Data Analysis
    • Data Visualization
    • Using groupby
  4. Data Prepocessing
    1. Handling Missing Values
    2. Target variable
    3. Converting categorical variables to numeric variables
  5. Overview of the Methods.

    1. Gradient Descent a. Batch Gradient Descent b. Stochastic Gradient Descent c. Mini-Batch Gradient Descent
    2. Neural Networks a. Bulding Blocks: Neurons b. A Simple Example c. Combining Neurons into Neural Network d. Feedforward e. Training a Neural Network
  6. Applying Machine Learning to predict DECISION using RATING

    1. Decision Tree
    2. Logistic Regression
    3. Random Forest
    4. Stochastic Gradient Descent
    5. KNN
    6. Gaussian Naive Bayes
    7. Perceptron
    8. SVM
    9. Linear SVM
    10. Adaptive Boosting
    11. XGBoost
    12. Which Model is the best ? Table 1
  7. Models with the grid search(to predict Decision)

    1. Decision Tree (DT)
    2. Logistic Regression (LG)
    3. Random Forest (RF)
    4. Stochastic Gradient Descent (SGD)
    5. KNN
    6. Gaussian Naive Bayes (GNB)
    7. Perceptron
    8. SVM
    9. Linear SVM
    10. Adaptive Boosting
    11. XGBoost
    12. Cat Boost
    13. Light GBM
    14. Which Model is the best ? Table 4
    15. Stacking Approach
    16. Using h2o AutoML
  8. Estimating the Rating Variable packages
    1. Individual Models
    2. h2o AutoML
    3. h2o GBM
    4. h2o RF
    5. h2o RF
    6. Deep Learning Estimator
    7. Deep Water Estimator
  9. References

Introduction

The data used for this analysis was collected from a major Universities Graduate Mathematics Application system for students applying for the Mathematics PHD program. The information is used by the department of mathematics to determine which applicants will be admitted into the graduate program. Each year members of the department of mathematics review each graduate application and give the prospective student a rating score between one and five, five being the best, with all values in between possible. This rating score determines whether an applicant is accepted, rejected, or put on a waitlist for the Universities Mathematics graduate program.

The rating score (or just RATING) and whether an applicant is accepted, rejected, or put on a waitlist (DECISION) are the variables of interest for this project. The purpose of this research is to create both a regression and classification models that can accurately predict the RATING and DECISION, based on the data submitted by the student. The models we use includes Random Forest, Gradient Boosting, Generalized Linear Models, Stacked Ensemble, XGBoost and Deep Learning.

Data Set

The data is collected in a spreadsheet for easy visual inspection. Each row of data represents a single applicant identified by a unique identification number. Each application consists of the qualitative and quantitative data described in the table below. Note that the qualitative variables are identified by blue highlighting.The following variables make up the columns of the spreadsheet. Note that some of these fields are optional for the student to submit, so not every field has an entry for every student. This creates an issues of missing data, and later on we will discuss how this issue was dealt with.

Table 1.1.

# Variable Description Type
1 Applicant Client ID Application ID Numeric
2 Emphasis Area First choice of study area Factor
3 Emphasis Area 2 Secondary choice of study area Factor
4 Emphasis Area 3 Tertiary choice of study area Factor
5 UU_APPL_CITIZEN US Citizen (Yes or No) Factor(Binary)
6 CTZNSHP Citizenship of the Applicant
7 AGE Age of the applicant in years Numeric
8 SEX Gender of the applicant Factor
9 LOW_INCOME If the applicant is coming from low income family Factor(Binary)
10 UU_FIRSTGEN If the appicant is the first generation attending grad school Factor(Binary)
11 UU_APPL_NTV_LANG Applicant's native language Factor
12 HAS_LANGUAGE_TEST Foreign Language Exam, if applicable(TOEFL IBT, IELTS, or blank) Factor
13 TEST_READ Score on the reading part of TOEFL Numeric
14 TEST_SPEAK Score on the speaking part of TOEFL Numeric
15 TEST_WRITE Score on the writing part of TOEFL Numeric
16 TEST_LISTEN Score on the listening part of TOEFL Numeric
17 MAJOR Applicant's undergraduate major Factor
18 GPA Applicant's GPA Numeric
19 NUM_PREV_INSTS Number of the previous institutions student studied Numeric
20 HAS_GRE_GEN If applicant has taken GRE General exam Factor(Binary)
21 GRE_VERB Raw score on verbal part of the GRE Numeric
22 GRE_QUANT Raw score on quantitative part of the GRE Numeric
23 GRE_AW Raw score on analytical writing part of the GRE Numeric
24 HAS_GRE_SUBJECT If applicant has taken GRE Subject exam Factor(Binary)
25 GRE_SUB Raw score on Math subject GRE Numeric
26 NUM_RECOMMENDS Number of recommenders of the applicant Numeric
27 R_AVG_ORAL Average rating of recommenders' for applicant's oral excellence Numeric
28 R_AVG_WRITTEN Average rating of recommenders' for applicant's written excellence Numeric
29 R_AVG_ACADEMIC Average rating of recommenders' for applicant's academic excellence Numeric
30 R_AVG_KNOWLEDGE Average rating of recommenders' for applicant's knowledge of field excellence Numeric
31 R_AVG_EMOT Average rating of recommenders' for applicant's emotional excellence Numeric
32 R_AVG_MOT Average rating of recommenders' for applicant's motivational excellence Numeric
33 R_AVG_RES Average rating of recommenders' for applicant's research of skill excellence Numeric
34 R_AVG_RATING Average rating of recommenders' for applicant's overall rating Numeric
35 RATING Rating score of the committee Numeric
36 DECISION Faculty application decision (Accept, Reject, or Waitlist) Factor

The data set includes 759 graduate applications, that were submitted for admission in Fall 2016, Fall 2017, Fall 2018 and Fall 2019. There are various missing data points throughout both the dataset. The figure 1.1. below describes the number of missing values for each variable for whole data set. Missing data is represented by shorter columns. The bottom of the table lists the various variable names. The top of the table represents how many data entries we have. On the left of the table is the percentage of the missing data for a specific category. The numbers on the right of the table records the number of variables that each variable has. For example, on the bottom columns starting from TEST_READ, TEST_SPEAK, TEST_WRITE and TEST_LISTEN have shorter columns.

Figure 1.1.

The applicants age (AGE) was calculated using the applicants birthday and is accurate as of 1 January of the year in which they applied. Also, since all universities do not use the same GPA scale, GPA values over four were reviewed and scaled based on information deduced from the applicants resume.

Data Analysis

To be able to do some analysis. We will need to load the data into the jyputer notebook. We can see the head of the data. The output is hided because of confidentiality reasons.

After loading the data into jupyter notebook, we can see the name of the variables, the number of non missing observations each variable has and the type of the variable.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 36 columns):
Applicant_Client_ID    759 non-null int64
Emphasis Area          759 non-null object
Emphasis Area 2        759 non-null object
Emphasis Area 3        759 non-null object
UU_APPL_CITIZEN        759 non-null object
CTZNSHP                759 non-null object
AGE                    759 non-null float64
SEX                    759 non-null object
LOW_INCOME             759 non-null object
UU_FIRSTGEN            759 non-null object
UU_APPL_NTV_LANG       759 non-null object
HAS_LANGUAGE_TEST      759 non-null object
TEST_READ              272 non-null float64
TEST_SPEAK             272 non-null float64
TEST_WRITE             272 non-null float64
TEST_LISTEN            272 non-null float64
MAJOR                  759 non-null object
GPA                    759 non-null float64
NUM_PREV_INSTS         759 non-null int64
HAS_GRE_GEN            759 non-null object
GRE_VERB               647 non-null float64
GRE_QUANT              647 non-null float64
GRE_AW                 647 non-null float64
HAS_GRE_SUBJECT        759 non-null object
GRE_SUB                554 non-null float64
NUM_RECOMMENDS         759 non-null int64
R_AVG_ORAL             759 non-null float64
R_AVG_WRITTEN          759 non-null float64
R_AVG_ACADEMIC         759 non-null float64
R_AVG_KNOWLEDGE        759 non-null float64
R_AVG_EMOT             759 non-null float64
R_AVG_MOT              759 non-null float64
R_AVG_RES              759 non-null float64
R_AVG_RATING           759 non-null float64
RATING                 759 non-null float64
DECISION               759 non-null object
dtypes: float64(19), int64(3), object(14)
memory usage: 213.6+ KB

As we mentioned in Table 1.1. there are 36 columns with 759 number of observations. Let us see the number of the missing values for each variable.

Applicant_Client_ID      0
Emphasis Area            0
Emphasis Area 2          0
Emphasis Area 3          0
UU_APPL_CITIZEN          0
CTZNSHP                  0
AGE                      0
SEX                      0
LOW_INCOME               0
UU_FIRSTGEN              0
UU_APPL_NTV_LANG         0
HAS_LANGUAGE_TEST        0
TEST_READ              487
TEST_SPEAK             487
TEST_WRITE             487
TEST_LISTEN            487
MAJOR                    0
GPA                      0
NUM_PREV_INSTS           0
HAS_GRE_GEN              0
GRE_VERB               112
GRE_QUANT              112
GRE_AW                 112
HAS_GRE_SUBJECT          0
GRE_SUB                205
NUM_RECOMMENDS           0
R_AVG_ORAL               0
R_AVG_WRITTEN            0
R_AVG_ACADEMIC           0
R_AVG_KNOWLEDGE          0
R_AVG_EMOT               0
R_AVG_MOT                0
R_AVG_RES                0
R_AVG_RATING             0
RATING                   0
DECISION                 0
dtype: int64

Data Visualization

We would like to see the relations between variables via visualization. Let us start by counting number of students who admitted, rejected and waitlisted.

Reject      403
Waitlist    242
Admit       114
Name: DECISION, dtype: int64

Histogram of the all students look like

WE would like to show the relations between their GPA, recommendor's avarage rating and the committee rating.

The next figure shows the relationship between decision variable according to the gender (sex variable).

In the following figure, we will see the relationship between decision, major and sex.

This histogram does not tell you much except that unspecified sex has equal number of being accepted,rejected, or waitlisted.

Let us see the scatter plot of decision variable according to AGE and GPA variables. This scatter plot shows that there are 9-10 students over the age of 40 and 2 of them are admitted.

We wonder if the average rating of recommenders' for applicant's overall rating has any relation between decision variable. We see that there is no implication that higher overall rating implies higher chance of admitted.

Next plot shows the relationship between low income and decision. The distribution of the each category(Admit, Reject,or Waitlist) looks very similar based on income of the family of the applicant.

Similarly, let us the if there is a strong relation between first generation who attends to graduate school and decision.

Let us the if there is a strong relation between number of previous students and decision.

These histograms show that low income, being first generation in your family coming to grad school or number of previous institutions you studied are relavent to being accepted.

In the next plot, we will see the relation between GPA and decision. From the plot we see that waitlisted people have higher average than GPA of admitted people.

In the following plot, we see that admitted students tend to have higher number of previous institutions.

Using groupby to see the relations

First we drop applicant's client ID. The next table shows the mean of the numerical variables according to their gender.

<class 'pandas.core.frame.DataFrame'>
AGE TEST_READ TEST_SPEAK TEST_WRITE TEST_LISTEN GPA NUM_PREV_INSTS GRE_VERB GRE_QUANT GRE_AW ... NUM_RECOMMENDS R_AVG_ORAL R_AVG_WRITTEN R_AVG_ACADEMIC R_AVG_KNOWLEDGE R_AVG_EMOT R_AVG_MOT R_AVG_RES R_AVG_RATING RATING
SEX
Female 23.865241 26.148936 22.308511 23.744681 25.680851 3.493529 1.877005 155.368750 163.937500 3.868750 ... 3.197861 14.771658 14.660428 16.567914 16.156150 16.290374 17.993048 14.368984 22.209091 3.754545
Male 24.333816 27.418981 21.875000 24.476852 26.037037 3.328879 1.853526 156.038217 166.065817 3.757962 ... 3.160940 14.212477 14.263291 16.847920 16.963834 15.775769 17.888969 14.633635 22.992043 3.649747
Unspecified 24.073684 27.333333 20.166667 24.222222 22.666667 3.595789 1.789474 153.187500 166.437500 3.468750 ... 3.421053 12.584211 12.963158 16.300000 15.752632 15.005263 16.542105 13.442105 20.826316 4.089474

3 rows × 21 columns

Let us group the students according to their decision and their gender.

SEX Female Male Unspecified
DECISION
Admit 33 76 5
Reject 92 305 6
Waitlist 62 172 8

Let us group the students according to their decision, gender, first generation and their gender.

SEX Female Male Unspecified
DECISION UU_FIRSTGEN
Admit N 12.0 11.0 NaN
Unspecified 20.0 62.0 5.0
Y 1.0 3.0 NaN
Reject N 22.0 61.0 NaN
Unspecified 62.0 223.0 5.0
Y 8.0 21.0 1.0
Waitlist N 11.0 36.0 1.0
Unspecified 46.0 123.0 7.0
Y 5.0 13.0 NaN

Data Prepocessing

Handling Missing values

The following table shows how many missing values each variable has.

Emphasis Area          0
Emphasis Area 2        0
Emphasis Area 3        0
UU_APPL_CITIZEN        0
CTZNSHP                0
AGE                    0
SEX                    0
LOW_INCOME             0
UU_FIRSTGEN            0
UU_APPL_NTV_LANG       0
HAS_LANGUAGE_TEST      0
TEST_READ            487
TEST_SPEAK           487
TEST_WRITE           487
TEST_LISTEN          487
MAJOR                  0
GPA                    0
NUM_PREV_INSTS         0
HAS_GRE_GEN            0
GRE_VERB             112
GRE_QUANT            112
GRE_AW               112
HAS_GRE_SUBJECT        0
GRE_SUB              205
NUM_RECOMMENDS         0
R_AVG_ORAL             0
R_AVG_WRITTEN          0
R_AVG_ACADEMIC         0
R_AVG_KNOWLEDGE        0
R_AVG_EMOT             0
R_AVG_MOT              0
R_AVG_RES              0
R_AVG_RATING           0
RATING                 0
DECISION               0
dtype: int64

By looking the table, either we can get rid off 8 variables that have missing values or we can fill them mean, median or common values. We will go with the latter method. In other words we will apply simple imputation method.

Before imputation, first, let us take a look of GPA. We know GPA should not be higher than 4. Let us see if there is GPA higher than 4.

733    4.08
105    4.05
34     4.00
36     4.00
69     4.00
114    4.00
Name: GPA, dtype: float64

This shows we have two student entered their GPA higher than 4. We will set them to 4.00 to be consistent.

We would like to impute all the missing variables. A lot of the applicants are from English speaking countries like US, UK and Canada. That is why, their TOEFL scores are missing. According to this website, https://www.prepscholar.com/toefl/blog/what-is-the-average-toefl-score/ United States's average TOEFL score is for Reading 21, for Speaking 23, for Writing 22 and for Listening 23. Total of these scores is 89. This is also very similar to UK and Canada. Before imputing the missing variables with this average of the countries given on this website, we will see first average scores other students TOEFL score for each section based on their gender.

AGE TEST_READ TEST_SPEAK TEST_WRITE TEST_LISTEN GPA NUM_PREV_INSTS GRE_VERB GRE_QUANT GRE_AW ... NUM_RECOMMENDS R_AVG_ORAL R_AVG_WRITTEN R_AVG_ACADEMIC R_AVG_KNOWLEDGE R_AVG_EMOT R_AVG_MOT R_AVG_RES R_AVG_RATING RATING
SEX
Female 23.865241 26.148936 22.308511 23.744681 25.680851 3.493102 1.877005 155.368750 163.937500 3.868750 ... 3.197861 14.771658 14.660428 16.567914 16.156150 16.290374 17.993048 14.368984 22.209091 3.754545
Male 24.333816 27.418981 21.875000 24.476852 26.037037 3.328788 1.853526 156.038217 166.065817 3.757962 ... 3.160940 14.212477 14.263291 16.847920 16.963834 15.775769 17.888969 14.633635 22.992043 3.649747
Unspecified 24.073684 27.333333 20.166667 24.222222 22.666667 3.595789 1.789474 153.187500 166.437500 3.468750 ... 3.421053 12.584211 12.963158 16.300000 15.752632 15.005263 16.542105 13.442105 20.826316 4.089474

3 rows × 21 columns

Imputing the missing Values

We will replaces the missing values of any variables with the mean of other observations for particular variable accoring to their gender.

To make sure our code works, we will check if there is any missing values.

Emphasis Area        0
Emphasis Area 2      0
Emphasis Area 3      0
UU_APPL_CITIZEN      0
CTZNSHP              0
AGE                  0
SEX                  0
LOW_INCOME           0
UU_FIRSTGEN          0
UU_APPL_NTV_LANG     0
HAS_LANGUAGE_TEST    0
TEST_READ            0
TEST_SPEAK           0
TEST_WRITE           0
TEST_LISTEN          0
MAJOR                0
GPA                  0
NUM_PREV_INSTS       0
HAS_GRE_GEN          0
GRE_VERB             0
GRE_QUANT            0
GRE_AW               0
HAS_GRE_SUBJECT      0
GRE_SUB              0
NUM_RECOMMENDS       0
R_AVG_ORAL           0
R_AVG_WRITTEN        0
R_AVG_ACADEMIC       0
R_AVG_KNOWLEDGE      0
R_AVG_EMOT           0
R_AVG_MOT            0
R_AVG_RES            0
R_AVG_RATING         0
RATING               0
DECISION             0
dtype: int64

After imputing all the variables, it is time to see the histograms of each variables.

Target Variable

 mu = 3.69 and sigma = 0.78

This is slightly left-skewed. But we will keep it this way.

Now the data almost ready. We would like to convert categorical variables to numeric variables.

Converting categorical variables to numeric variables.

#students.head(12)

Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. For example, machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

We will use one hot encoding techninque to convert all the categorical variable into numeric variable.

Overview of the methods.

Gradient Descent

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

MSE cost function for a Linear Regression model $$ MSE(X,h_\theta)=\frac{1}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)$$ where $\theta$ is the model’s parameter vector, containing the bias term $\theta_0$ and the feature weights $\theta_1$ to $\theta_n$.

  • $\theta^T$ is the transpose of $\theta$ (a row vector instead of a column vector).
  • $x$ is the instance’s feature vector, containing $x_0$ to $x_n$, with $x_0$ always equal to 1.
  • $\theta^T \cdot x$ is the dot product of $\theta^T$ and $x$.
  • $h_\theta$ is the hypothesis function, using the model parameters $\theta$.

Gradient Descent measures the local gradient of the error function with regards to the parameter vector $\theta $, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum.

Concretely, you start by filling $\theta$ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution

Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. Next figure shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with its derivative is Lipschitz continuous. These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).

Batch Gradient Descent

To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter $\theta_j$. In other words, you need to calculate partial derivatives. $$\frac{\partial }{\partial \theta_j}MSE(\theta) = \frac{2}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)x^{(i)}_j$$

Instead of computing these gradients individually, you can use $$ \nabla_\theta MSE(\theta)= \frac{2}{m}X^T\cdot(X\cdot \theta-y) $$ to compute them all in one go. The gradient vector, noted $\nabla_\theta MSE(\theta)$, contains all the partial derivatives of the cost function (one for each model parameter).

Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. As a result it is terribly slow on very large training sets (but we will see much faster Gradient Descent algorithms shortly). However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation.

Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting $\nabla_\theta MSE(\theta)$ from $\theta$. This is where the learning rate $\eta$ comes into play: multiply the gradient vector by $\eta$ to determine the size of the downhill step. $$\theta^{next \; step }=\theta-\eta \nabla_\theta MSE(\theta) $$

But what if you had used a different learning rate $\eta$? The next figure shows the first 10 steps of Gradient Descent using three different learning rates (the dashed line represents the starting point).

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step. To find a good learning rate, you can use grid search. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.

Stochastic Gradient Descent

The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm.)

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.

When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.

Therefore randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing, because it resembles the process of annealing in metallurgy where molten metal is slowly cooled down. The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.

By convention we iterate by rounds of m iterations; each round is called an epoch.

Mini-batch Gradient Descent

The last Gradient Descent algorithm we will look at is called Mini-batch Gradient Descent. It is quite simple to understand once you know Batch and Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Minibatch GD computes the gradients on small random sets of instances called minibatches.

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. As a result, Mini-batch GD will end up walking around a bit closer to the minimum than SGD. But, on the other hand, it may be harder for it to escape from local minima (in the case of problems that suffer from local minima, unlike Linear Regression as we saw earlier). The next figure shows the paths taken by the three Gradient Descent algorithms in parameter space during training. They all end up near the minimum, but Batch GD’s path actually stops at the minimum, while both Stochastic GD and Mini-batch GD continue to walk around. However, don’t forget that Batch GD takes a lot of time to take each step, and Stochastic GD and Mini-batch GD would also reach the minimum if you used a good learning schedule.

Neural Networks

Building Blocks: Neurons

First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:

3 things are happening here. First, in a red square, each input is multiplied by a weight:

\begin{align} x_1 & \to x_1* w_1\\ x_2 & \to x_2* w_2\\ \end{align}

Next, in a blue square, all the weighted inputs are added together with a bias b:

$$(x_1*w_1)+(x_2*w_2)+b$$

Finally, in the orange square, the sum is passed through an activation function

$$y=f(x_1*w_1+x_2*w_2+b)$$

The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function: \begin{equation} {\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.} \end{equation}

The sigmoid function only outputs numbers in the range (0,1). You can think of it as compressing $(-\infty, +\infty)$ to $(0,1)$ - big negative numbers become $\sim 0$, and big positive numbers become $\sim 1$

A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point. A sigmoid "function" and a sigmoid "curve" refer to the same object.

A Simple Example

Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:

\begin{align} w &=(0,1) \\ b & = 4\\ \end{align}

where $w_1=0$ and $w_2=1$. Now, let’s give the neuron an input of $x=(2,3)$. We’ll use the dot product to write things more concisely: \begin{align} (w*x)+b= & ((w_1*x_1)+(w_2*x_2))+b \\ =& 0*2+1*3+4\\ =& 7\\ y=f(w*x+b)=&f(7)=1 / (1 + e^{-7})= 0.999 \end{align}

The neuron outputs 0.999 given the inputs $x=(2,3)$. That’s it! This process of passing inputs forward to get an output is known as feedforward.

Combining Neurons into a Neural Network

A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:

This network has 2 inputs, a hidden layer with 2 neurons $(h_1$ and $h_2)$, and an output layer with $1$ neuron $(o_1)$. Notice that the inputs for $o_1$ are the outputs from $h_1$and $h_2$- that’s what makes this a network.

A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!

An Example: Feedforward

Let’s use the network pictured above and assume all neurons have the same weights $w=(0,1)$, the same bias $b = 0$, and the same sigmoid activation function. Let $h_1, h_2, o_1$ denote the outputs of the neurons they represent.

What happens if we pass in the input $x = (2, 3)$?

\begin{align} h_1=h_2&=f(w* x+b) \\ &=f((0* 2)+(1* 3)+0)\\ &=f(3)\\ &=1 / (1 + e^{-3})\\ &=0.9526 \\ o_1&=f(w* (h_1,h_2)+b)\\ &=f((0* h_1)+(1* h_2)+0)\\ &=f(0.9526)\\ &=1 / (1 + e^{-0.9526})\\ &=0.7216 \end{align}

The output of the neural network for input $x = (2, 3)$ is 0.7216. Pretty simple, right?

A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this topic.

Training a Neural Network

Say we have the following measurements:

Name Weight(lb) Height(in) Gender
Alice 132 65 F
Bob 160 72 M
Charlie 152 75 M
Diana 120 60 F

Let’s train our network to predict someone’s gender given their weight and height:

We’ll represent Male with a 0 and Female with a 1, and we will also shift the data to make it easier to use:

Name Weight (minus 141) Height (minus 68 ) Gender
Alice -9 -3 1
Bob 19 4 0
Charlie 11 7 0
Diana -21 -8 1

Here, note that $(132+160+152+120)/4=141$ and $(65+72+75+60)/4=68$

Loss

Before we train our network, we first need a way to quantify how "good" it's doing so that it can try to do "better". That's what the loss is.

We'll use the mean squared error (MSE) loss:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{true}-y_{pred})^2$$

Let's break this down:

  • n is the number of samples, which is 4.
  • y represents the variable being predicted, which is Gender.
  • $y_{true}$ is the true value of the variable. For example, $y_{true}$ for Alice would be 1 (Female).
  • $y_{pred}$ is the predicted value of the variable. It’s whatever our network outputs.

$(y_{true}-y_{pred})^2$ is known as the squared error. Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be!

Training a network = trying to minimize its loss.

An Example Loss Calculation

Let’s say our network always outputs 0 - in other words, it's confident all humans are Male 🤔. What would our loss be?

Let diff = $(y_{true}-y_{pred})^2$

Name $y_{true}$ $y_{pred}$ diff
Alice 1 0 1
Bob 0 0 0
Charlie 0 0 0
Diana 1 0 1
$$ MSE = \frac{1}{4}(1+0+0+1)=0.5 $$

We now have a clear goal: minimize the loss of the neural network. We know we can change the network's weights and biases to influence its predictions, but how do we do so in a way that decreases loss?

For simplicity, let's pretend we only have Alice in our dataset:

Name Weight (minus 141) Height (minus 68 ) Gender
Alice -9 -3 1

Then the mean squared error loss is just Alice’s squared error:

\begin{align} MSE&=\frac{1}{1}\sum_{i=1}^1(y_{true}−y_{pred})^2\\ &=(y_{true}−y_{pred})^2\\ & =(1−y_{pred})^2 \end{align}

Another way to think about loss is as a function of weights and biases. Let’s label each weight and bias in our network:

Then, we can write loss as a multivariable function: $$L(w_1,w_2,w_3,w_4,w_5,w_6,b_1,b_2,b_3)$$

Imagine we wanted to tweak $w_1$. How would loss $L$ change if we changed $w_1$? That's a question the partial derivative $\frac{\partial L}{\partial w_1}$can answer. How do we calculate it?

To start, let's rewrite the partial derivative in terms of $\frac{\partial y_{pred}}{\partial w_1}$ instead: $$\dfrac{\partial L}{\partial w_1}= \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial w_1} $$

We can calculate $\frac{\partial L}{\partial y_{pred}}$ because we computed $L = (1 - y_{pred})^2$ above:

$$\dfrac{\partial L}{\partial y_{pred}} = \dfrac{\partial (1 - y_{pred})^2}{\partial y_{pred}}= -2(1-y_{pred})$$

Now, let's figure out what to do with $\frac{\partial y_{pred}}{\partial w_1}$. Just like before, let $h_1, h_2, o_1$ be the outputs of the neurons they represent. Then

$$ y_{pred}=o_1=f(w_5*h_1+w_6*h2+b_3)$$

Since $w_1$ only affects $h_1$ (not $h_2$), we can write

$$\dfrac{\partial y_{pred}}{\partial w_1} =\dfrac{\partial y_{pred}}{\partial h_1} *\dfrac{\partial h_1}{\partial w_1} $$

Also note that by using chain rule, $$ \dfrac{\partial y_{pred}}{\partial h_1} = w_5*f'(w_5h_1+w_6h_2+b_3)$$ Recall $h_1 = f(w_1x_1+w_2x_2+b_1)$. Thus, we can do the same thing for $\frac{\partial h_1}{\partial w_1} $: $$ \dfrac{\partial h_1}{\partial w_1} = x_1*f'(w_1x_1+w_2x_2+b_1)$$ $x_1$ here is weight, and $x_2$ is height. This is the second time we've seen $f'(x)$ (the derivate of the sigmoid function) now! Let’s derive it:

$$ f(x) = \dfrac{1}{1+e^{-x}}$$

By taking derivative, we get $$f'(x)= \dfrac{e^{-x}}{(1 + e^{-x})^2}=f(x) * (1 - f(x))$$

We'll use this nice form for $f'(x)$ later. This form shows we do not need to take a derivative.

We're done! We've managed to break down $\frac{\partial L}{\partial w_1}$ into several parts we can calculate it now: $$\dfrac{\partial L}{\partial w_1} = \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1} $$

This system of calculating partial derivatives by working backwards is known as backpropagation, or "backprop".

Example: Calculating the Partial Derivative

We're going to continue pretending only Alice is in our dataset:

Name Weight (minus 141) Height (minus 68 ) Gender
Alice -9 -3 1

Let's initialize all the weights to 1 and all the biases to 0. If we do a feedforward pass through the network, we get:

$$ h_1 =f(w_1*x_1+w_2*x_2+b_1)=f(−9+−3+0)=6.16*10^{-6}$$

and similarly $$h_2 =f(w_3*x_1+w_4*x_2+b_2)=f(−9+−3+0)=6.16*10^{-6} $$ and now let us calculate $o_1$ $$o_1 =f(w_5*h_1+w_6*h_2+b_3)=f(6.16*10^{-6}+6.16*10^{-6}+0)=0.50$$

The network outputs $y_{pred} = 0.50$, which doesn't favor Male(0) or Female (1). This totally makes sense because we do not do any training yet.

Let's calculate $\frac{\partial L}{\partial w_1}$:

\begin{aligned} \dfrac{\partial L}{\partial w_1} =& \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1}\\ \end{aligned}

Now let us calculate each of the terms on the RHS one by one. \begin{aligned} \dfrac{\partial L}{\partial y_{pred}} &= -2(1 - y_{pred}) \\ &= -2(1 - 0.50) \\ &= -1 \\ \end{aligned} and \begin{aligned} \dfrac{\partial y_{pred}}{\partial h_1} &= w_5 * f'(w_5h_1 + w_6h_2 + b_3) \\ &= 1 * f'(6.16* 10^{-6} + 6.16* 10^{-6}+ 0) \\ &= f(1.23* 10^{-5}) * (1 - f(1.23* 10^{-5})) \\ &= 0.249 \\ \end{aligned} lastly \begin{aligned} \dfrac{\partial h_1}{\partial w_1} &= x_1 * f'(w_1x_1 + w_2x_2 + b_1) \\ &= -9 * f'(-9 + -3 + 0) \\ &= -9 * f(-12) * (1 - f(-12)) \\ &= -5.52* 10^{-5} \\ \end{aligned} Now, we can collect them all and write \begin{aligned} \dfrac{\partial L}{\partial w_1} &= -1 * 0.249 * -5.52* 10^{-5} \\ &= \boxed{1.37* 10^{-5}} \\ \end{aligned}

We did it! This tells us that if we were to increase $w_1$, $L$ would increase a tiny bit as a result.

Training: Stochastic Gradient Descent

We have all the tools we need to train a neural network now! We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation

$$ w_1\leftarrow w_1-\eta \dfrac{\partial L}{\partial w_1}$$

$\eta$ is a constant called the learning rate that controls how fast we train. All we're doing is subtracting $\eta \frac{\partial L}{\partial w_1}$ from $w_1$:

  • If $\frac{\partial L}{\partial w_1}$ is positive, $w_1$ will decrease, which makes $L$ decrease.
  • If $\frac{\partial L}{\partial w_1}$ is negative, $w_1$ will increase, which makes $L$ increase.

If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.

Our training process will look like this:

  1. Choose one sample from our dataset. This is what makes it stochastic gradient descent - we only operate on one sample at a time.
  2. Calculate all the partial derivatives of loss with respect to weights or biases (e.g. $\frac{\partial L}{\partial w_1}$,$\frac{\partial L}{\partial w_2}$, etc).
  3. Use the update equation to update each weight and bias.
  4. Go back to step 1.

Applying Machine Learning to predict DECISION using RATING

Splitting Data Set

We are splitting the data into two parts train and test data. We will test our algorithm after we trained on a train data. We will be using cross validation technique on a train data.

Scaling

This is a crucial step in rescaling input data so that all the features are mean zero with a unit variance.

First, we will estimate DECISION using RATING variable. After predicting the decision, we will be predicting RATING variable. When we predict RATING, we wil not be using decision variable.

Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost(short for Adaptive Boosting) and Gradient Boosting and XGBoost. Let’s start with Ada‐ Boost.

Adaptive Boosting

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by Ada‐ Boost. For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on. The next figure explains the structure.

Decision Tree

Decision Trees are also the fundamental components of Random Forests which are among the most powerful Machine Learning algorithms available today. To understand Decision Trees, let’s just visualize one and take a look at how it makes predictions.

Let’s see how the tree represented in the figure above makes predictions. Assume you are looking iris data set. Suppose you find an iris flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris-Setosa (class=setosa).

Now suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really that simple.

It will be very similar structure in our case however decision tree will be big because of number of variables. We will be showing one decision tree in order to give us an idea.

Tree 0 RATING <= 0.065 gini = 0.593 samples = 607 value = [91, 324, 192] 1 RATING <= -0.499 gini = 0.245 samples = 275 value = [8, 237, 30] 0->1 True 14 RATING <= 0.616 gini = 0.631 samples = 332 value = [83, 87, 162] 0->14 False 2 RATING <= -0.905 gini = 0.123 samples = 169 value = [2, 158, 9] 1->2 9 R_AVG_KNOWLEDGE <= -1.393 gini = 0.402 samples = 106 value = [6, 79, 21] 1->9 3 TEST_WRITE <= 1.714 gini = 0.04 samples = 98 value = [0, 96, 2] 2->3 6 GRE_VERB <= 0.82 gini = 0.227 samples = 71 value = [2, 62, 7] 2->6 4 gini = 0.021 samples = 93 value = [0, 92, 1] 3->4 5 gini = 0.32 samples = 5 value = [0, 4, 1] 3->5 7 gini = 0.142 samples = 53 value = [1, 49, 3] 6->7 8 gini = 0.426 samples = 18 value = [1, 13, 4] 6->8 10 gini = 0.531 samples = 8 value = [1, 2, 5] 9->10 11 GPA <= 0.584 gini = 0.353 samples = 98 value = [5, 77, 16] 9->11 12 gini = 0.284 samples = 85 value = [3, 71, 11] 11->12 13 gini = 0.615 samples = 13 value = [2, 6, 5] 11->13 15 R_AVG_WRITTEN <= 0.39 gini = 0.574 samples = 165 value = [14, 76, 75] 14->15 22 RATING <= 1.263 gini = 0.554 samples = 167 value = [69, 11, 87] 14->22 16 GPA <= 0.188 gini = 0.538 samples = 91 value = [6, 52, 33] 15->16 19 R_AVG_RATING <= -0.128 gini = 0.561 samples = 74 value = [8, 24, 42] 15->19 17 gini = 0.426 samples = 41 value = [4, 30, 7] 16->17 18 gini = 0.534 samples = 50 value = [2, 22, 26] 16->18 20 gini = 0.355 samples = 19 value = [2, 2, 15] 19->20 21 gini = 0.587 samples = 55 value = [6, 22, 27] 19->21 23 MAJOR_4 <= 1.037 gini = 0.512 samples = 110 value = [30, 10, 70] 22->23 26 Emphasis Area 2_4 <= 0.221 gini = 0.443 samples = 57 value = [39, 1, 17] 22->26 24 gini = 0.48 samples = 99 value = [23, 9, 67] 23->24 25 gini = 0.512 samples = 11 value = [7, 1, 3] 23->25 27 gini = 0.521 samples = 35 value = [19, 1, 15] 26->27 28 gini = 0.165 samples = 22 value = [20, 0, 2] 26->28

Gaussian Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

K Nearest Neighbor

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

  • In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

Support Vector Machine

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

The fundamental idea behind SVMs is best explained with some picture. The figure below shows part of the iris dataset that was introduced before. The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes.

Logistic Regression

Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

Random Forest

A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with maximum samples set to the size of the training set.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model.

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).

Perceptron

The Perceptron is one of the simplest artificial neural network architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron (see figure below) called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs $$(z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n = w^T \cdot x)$$, then applies a step function to that sum and outputs the result: $$h_w(x) = step(z) = step(w^T \cdot x)$$.

The most common step function used in Perceptrons is the Heaviside step function.

A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM).

Stochastic Gradient Descent

We have already mentioned how this works. We will apply this model to our train set.

XG Boost Classifier

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.

Which is the best model?

Score
Model
Adaptive Boosting Classifier 0.710526
XG Boost Classifier 0.697368
Random Forest 0.631579
Stochastic Gradient Decent 0.625000
Logistic Regression 0.598684
Decision Tree 0.585526
Perceptron 0.572368
Support Vector Machines 0.565789
KNN 0.506579
Naive Bayes 0.203947

Models with Grid search

One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations. Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

Ada Boost

Decision Tree

KNN

Light Gradient Boosting

Linear SVM

Logistic Regression

Random Forest

SGD

Support Vector Machines

XGB Classifier

Which one is the best model with GridSearch?

Score
Model
Stochastic Gradient Decent 0.743421
Adaptive Boosting Classifier 0.710526
Decision Tree 0.703947
Random Forest 0.697368
XG Boost Classifier 0.697368
Light GBM 0.684211
Linear Support Vector Machines 0.644737
Support Vector Machines 0.625000
Logistic Regression 0.592105
KNN 0.513158

Let us combine these two table.

Score Score with grid search
Score Score
Adaptive Boosting Classifier 0.710526 0.710526
XG Boost Classifier 0.697368 0.697368
Random Forest 0.631579 0.697368
Stochastic Gradient Decent 0.625000 0.743421
Logistic Regression 0.598684 0.592105
Decision Tree 0.585526 0.703947
Perceptron 0.572368 NaN
Support Vector Machines 0.565789 0.625000
KNN 0.506579 0.513158
Naive Bayes 0.203947 NaN
Light GBM NaN 0.684211
Linear Support Vector Machines NaN 0.644737

Stacking Approach to predict Decision

The Ensemble method we will discuss in here apter is called stacking (short for stacked generalization). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation? The figure below shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0).

To train the blender, a common approach is to use a hold-out set. Let’s see how it works. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer in the figure below.

Next, the first layer predictors are used to make predictions on the second (held-out) set (see the figure below). This ensures that the predictions are “clean,” since the predictors never saw these instances during training. Now for each instance in the hold-out set there are three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set three-dimensional), and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer’s predictions.

It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression, and so on): we get a whole layer of blenders. The trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer (using predictions made by the predictors of the first layer), and the third one is used to create the training set to train the third layer (using predictions made by the predictors of the second layer). Once this is done, we can make a prediction for a new instance by going through each layer sequentially, as shown in

We use a few models that gave best accuracy on a test set with grid search as our base model. We will aggregate these models to make new model, stacked model. Below, we train these models on a train set using cross validation technique.

Average accuracy on a train set: 0.6755 (+/- 0.017) [ADA Boost]
Average accuracy on a train set: 0.6888 (+/- 0.027) [Decision tree]
Average accuracy on a train set: 0.6491 (+/- 0.027) [Light GBM]
Average accuracy on a train set: 0.6639 (+/- 0.028) [Random Forest]
Average accuracy on a train set: 0.7052 (+/- 0.025) [SGD]
Average accuracy on a train set: 0.6737 (+/- 0.045) [XGB]
Average accuracy on a train set: 0.6474 (+/- 0.034) [StackingClassifier]

Now, we will apply these models on a whole train set. We will generate new data frame that is the prediction of each model and get common values of this predictions to find one single prediction column of decision variable.

Average accuracy on a train set: %0.4f  0.8401976935749588

We will apply this approach to test set.

Last method we will be using to predict decision variable is h2o AutoML package.

Using h2o AutoML package to predict Decision

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

We will be using the same data frame we have used above however, we need to convert pandas data frame into h2o data frame in order to use h2o packages.

Now we will be applying h2o AutoML package to predict decision variable on train set. We will see the first 10 of models that gives the best prediction on a train set. Then we will pick the best model among this 10 to try on a test set.

AutoML progress: |████████████████████████████████████████████████████████| 100%
model_id mean_per_class_error logloss rmse mse
GBM_grid_1_AutoML_20191201_111358_model_2 0.407229 1.31798 0.5308780.281831
XGBoost_grid_1_AutoML_20191201_111358_model_1 0.417118 0.7258380.50669 0.256735
XGBoost_grid_1_AutoML_20191201_111358_model_4 0.418664 0.7241 0.5032480.253259
XGBoost_grid_1_AutoML_20191201_111358_model_2 0.41918 0.7070620.4977440.247749
XGBoost_2_AutoML_20191201_111358 0.427726 0.7263260.5070590.257109
StackedEnsemble_BestOfFamily_AutoML_20191201_111358 0.432357 0.7088520.4950340.245059
DeepLearning_grid_1_AutoML_20191201_111358_model_1 0.433148 2.0321 0.5641680.318286
XGBoost_grid_1_AutoML_20191201_111358_model_7 0.433316 0.74956 0.5185270.26887
XGBoost_grid_1_AutoML_20191201_111358_model_3 0.433455 0.7451360.5120050.262149
XGBoost_1_AutoML_20191201_111358 0.436663 0.7117 0.4964190.246431

This shows that GBM grid is the best one. Let us the performance of this model on a test set. We will see with what probabilities the model guess the different levels of the variable.

gbm prediction progress: |████████████████████████████████████████████████| 100%
Rows:152
Cols:4


predict Admit Reject Waitlist
type enum real real real
mins 5.209438295227014e-077.788663860572825e-054.030894069031509e-06
mean 0.0894543802504499 0.5430801441875348 0.3674654755620154
maxs 0.9989826928502821 0.9999940195863029 0.9998790114674498
sigma 0.23210418898047652 0.4541262415953114 0.41652984678224486
zeros 0 0 0
missing0 0 0 0
0 Waitlist 0.05679192739664471 0.4480436880052786 0.4951643845980767
1 Waitlist 0.03619576290528703 0.06348535268099545 0.9003188844137174
2 Reject 0.007500718061633539 0.7485192076192061 0.2439800743191604
3 Reject 3.618287029661471e-060.9997149406330772 0.00028144107989308824
4 Reject 7.983239844384858e-050.9989953284207064 0.0009248391808497036
5 Waitlist 0.007682258284996146 0.4039500282703496 0.5883677134446543
6 Waitlist 0.00340063943500132230.16439528560115582 0.8322040749638427
7 Waitlist 0.3277703961710042 0.00051304723905211450.6717165565899437
8 Reject 0.001140537974717637 0.8326599185445787 0.16619954348070373
9 Waitlist 0.061813918816480816 0.00701555139263608340.9311705297908832

Overall perforamnce and confusion matrix will look like as below. Surpisingly, overall accuracy is 71 percentage which is slightly worse than SGD.

performance = aml_first.leader.model_performance(test)
performance.show()
ModelMetricsMultinomial: gbm
** Reported on test data. **

MSE: 0.27305611997075074
RMSE: 0.5225477202808857
LogLoss: 1.365883168606446
Mean Per-Class Error: 0.3831297009722987

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Admit Reject Waitlist Error Rate
0 8.0 3.0 12.0 0.652174 15 / 23
1 2.0 65.0 12.0 0.177215 14 / 79
2 3.0 13.0 34.0 0.320000 16 / 50
3 13.0 81.0 58.0 0.296053 45 / 152
Top-3 Hit Ratios: 
k hit_ratio
0 1 0.703947
1 2 0.907895
2 3 1.000000

Conclusion about predicting the decision variable

The higher accuracy is the better. The accuracy table below shows that Stochastic Gradient Decent with grid search is the best model to predict the decision variable using the rating variable.

Accuracy
Model
Stochastic Gradient Decent 0.743421
Adaptive Boosting Classifier 0.710526
h2o AutoML 0.703947
Decision Tree 0.703947
Stacking(Mixed) Model 0.703947
Random Forest 0.697368
XG Boost Classifier 0.697368
Light GBM 0.684211
Linear Support Vector Machines 0.644737
Support Vector Machines 0.625000
Logistic Regression 0.592105
KNN 0.513158

Estimating the RATING variable.

Individual Models

For the first part, we will estimate the rating variable with different models then we will use stacking approach. We will not be using decision variable to predict rating variable. Thus, we will drop decision variable.

Now, we will train all our models on a train set by tuning hyperparameters of algorithm.

Let us see the performance of each of these individual models on a train set.

Ridge Regression RMSE score: 0.8138 (0.0685)

LASSO RMSE score: 0.7934 (0.0695)

Elasticnet RMSE score: 0.8250 (0.0737)

TSR RMSE score: 0.8332 (0.0737)

Huber RMSE score: 0.8119 (0.0582)

Kernel Ridge Regression RMSE score: 1.0570 (0.0928)

SVR RMSE score: 0.8092 (0.0592)

Light GBM RMSE score: 0.8729 (0.0661)

SGD RMSE score: 0.8173 (0.0749)

Linear Regression RMSE score: 0.8311 (0.0756)

Decision Tree RMSE score: 1.1643 (0.1081)

Random Forest RMSE score: 0.8624 (0.0719)

Gradient BoostingRMSE score: 0.9243 (0.0659)

XG boostRMSE score: 0.8785 (0.0583)

Since we are predicting the RATING variable which is numeric, we will use RMSE distance to measure how well our model is doing. The lower is the better. We can put this into a data frame in order to compare them.

Score
Model
LASSO Regression 0.793411
Epsilon-Support Vector Regression 0.809163
Huber Regressor 0.811865
Ridge Regression 0.813751
SGD 0.817268
Elastic Net 0.825020
Linear Regression 0.831060
Thielsen Regressor 0.833203
Random Forest regressor 0.862422
Light GBM 0.872895
XGBoost Regressor 0.878517
Gradient Boosting 0.924299
Kernel Ridge Regression 1.056997
Decision Tree Regressor 1.164290

Stacking approach (Blending)

Now, let us train all these models as well as stacking one on a whole train set.

Now it is time to mix all the models. All the numbers in front of models are chosen randomly. The higher number is given depending on their higher performance.

def mixed_models_predict(X):
    return (
            (0.01 * xgb_model_full_data.predict(X)) + \
            (0.01 * lgb_model_full_data.predict(X)) + \
            (0.05 * rf_model_full_data.predict(X)) + \
            (0.05 * tsr_model_full_data.predict(X)) + \
            (0.05 * lin_model_full_data.predict(X)) + \
            (0.05 * elastic_model_full_data.predict(X)) + \
            (0.15 * lasso_model_full_data.predict(X)) + \
            (0.05 * sgd_model_full_data.predict(X)) + \
            (0.1 * ridge_model_full_data.predict(X)) + \
            (0.1 * huber_model_full_data.predict(X)) + \
            (0.15 * svr_model_full_data.predict(X)) + \
            (0.29 * stack_gen_model.predict(np.array(X))))

Now we can try our mixed model both on a train set and test set.

RMSE score on train data:
0.7914288623274237
RMSE score on test data:
0.8117989462368014

This shows RMSE score of the stacking approach on a test data is slighly worse than individual model. We will use h2o AutoML to predict the rating variable.

Now, we would like to use h2o AutoML package to predict rating and to compare the result with our stacking approach result.

Using h2o AutoML to predict the RATING

We will see which model is doing better than others to predict rating variable. The table below shows first ten models.

AutoML progress: |████████████████████████████████████████████████████████| 100%
model_id mean_residual_deviance rmse mse mae rmsle
GLM_grid_1_AutoML_20191201_112423_model_1 0.5940930.7707740.5940930.6163690.179042
GBM_grid_1_AutoML_20191201_112423_model_5 0.5960920.77207 0.5960920.6174050.179331
GBM_grid_1_AutoML_20191201_112423_model_4 0.59661 0.7724060.59661 0.6172280.179354
GBM_grid_1_AutoML_20191201_112423_model_2 0.5983950.77356 0.5983950.6179190.17965
GBM_grid_1_AutoML_20191201_112423_model_1 0.5985240.7736440.5985240.6193540.17958
StackedEnsemble_BestOfFamily_AutoML_20191201_112423 0.5991040.7740180.5991040.6174480.179725
StackedEnsemble_AllModels_AutoML_20191201_112423 0.6016910.7756880.6016910.6186230.18008
XGBoost_grid_1_AutoML_20191201_112423_model_6 0.6081270.7798250.6081270.6297480.179725
GBM_5_AutoML_20191201_112423 0.6101990.7811530.6101990.6253020.180999
XGBoost_grid_1_AutoML_20191201_112423_model_7 0.6311040.7944210.6311040.6442460.182827

Let us see the performance of the best model on a test set. We would like to compare RMSE score with our stacking approach.

ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.7061227120640939
RMSE: 0.8403110805315457
MAE: 0.6512500190822342
RMSLE: 0.20689927999737506
R^2: -0.05415778048534392
Mean Residual Deviance: 0.7061227120640939
Null degrees of freedom: 157
Residual degrees of freedom: 94
Null deviance: 110.6498733711292
Residual deviance: 111.56738850612685
AIC: 523.4059100243061

Surprisingly, our stacking approach gives better RMSE score comparing to h2o AutoML.

Predicting RATING variable using other h2o models including Deep Learning

1. Using h2o Gradient Boosting Machine (GBM) to predict the Rating

The following tables shows

  1. Model summary (number of trees, number of leaves, max and min depth, min and max number of leaves)
  2. MSE, RMSE MAE and RMSLE scores on a train data and validation data.
  3. Scoring history (how the model gets lower RMSE depending on number of trees)
  4. Variable importance (Major_3, GPA, Age and R_AVG_KNOWLEDGE has higher imporance comparing to other variables.)
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_grid1_model_27


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 40.0 40.0 8624.0 5.0 5.0 5.0 8.0 16.0 12.55

ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.5518103907755587
RMSE: 0.7428394111620349
MAE: 0.5921554453670979
RMSLE: 0.1753199657510518
Mean Residual Deviance: 0.5518103907755587

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 0.5288314561262777
RMSE: 0.7272079868416447
MAE: 0.5890066077891671
RMSLE: 0.16456833669740678
Mean Residual Deviance: 0.5288314561262777

Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance validation_rmse validation_mae validation_deviance
0 2019-12-01 11:27:53 3.604 sec 0.0 0.786486 0.628663 0.618560 0.728273 0.592331 0.530382
1 2019-12-01 11:27:53 3.635 sec 5.0 0.780827 0.623733 0.609690 0.726937 0.591503 0.528438
2 2019-12-01 11:27:53 3.668 sec 10.0 0.774565 0.618349 0.599951 0.726292 0.590205 0.527500
3 2019-12-01 11:27:53 3.687 sec 15.0 0.770210 0.614570 0.593224 0.726165 0.589992 0.527316
4 2019-12-01 11:27:53 3.707 sec 20.0 0.764177 0.609492 0.583966 0.724791 0.588794 0.525322
5 2019-12-01 11:27:53 3.732 sec 25.0 0.758190 0.604573 0.574851 0.725716 0.589370 0.526664
6 2019-12-01 11:27:53 3.756 sec 30.0 0.752360 0.600157 0.566045 0.726778 0.589325 0.528207
7 2019-12-01 11:27:53 3.777 sec 35.0 0.748010 0.596589 0.559519 0.727005 0.589710 0.528536
8 2019-12-01 11:27:53 3.797 sec 40.0 0.742839 0.592155 0.551810 0.727208 0.589007 0.528831
Variable Importances: 
variable relative_importance scaled_importance percentage
0 MAJOR_3 166.577301 1.000000 0.115627
1 GPA 163.097412 0.979109 0.113212
2 AGE 146.658112 0.880421 0.101801
3 R_AVG_KNOWLEDGE 72.391663 0.434583 0.050250
4 Emphasis Area 2_2 58.542118 0.351441 0.040636
5 GRE_SUB 58.011650 0.348257 0.040268
6 GRE_QUANT 57.520313 0.345307 0.039927
7 GRE_VERB 54.470982 0.327001 0.037810
8 TEST_LISTEN 53.057537 0.318516 0.036829
9 R_AVG_ORAL 52.261028 0.313734 0.036276
10 R_AVG_MOT 52.131237 0.312955 0.036186
11 R_AVG_WRITTEN 39.846695 0.239208 0.027659
12 TEST_WRITE 37.386395 0.224439 0.025951
13 R_AVG_RES 34.760654 0.208676 0.024129
14 R_AVG_EMOT 27.030359 0.162269 0.018763
15 GRE_AW 26.256796 0.157625 0.018226
16 MAJOR_2 23.421410 0.140604 0.016258
17 Emphasis Area 3_1 20.035583 0.120278 0.013907
18 R_AVG_ACADEMIC 19.931606 0.119654 0.013835
19 Emphasis Area 2_4 18.787334 0.112784 0.013041
See the whole table with table.as_data_frame()

2. Using h2o Random Forest Algorithm to predict the Rating

The following tables shows

  1. Model summary (number of trees, number of leaves, max and min depth, min and max number of leaves)
  2. MSE, RMSE MAE and RMSLE scores on a validation data and cross-validation data.
  3. Cross-Validation Metrics Summary (MSE, RMSE MAE and RMSLE and their averages)
  4. Scoring history (how the model gets lower RMSE depending on number of trees)
  5. Variable importance (R_AVG_KNOWLEDGE, Age, GRE_AW and MAJOR_3 has higher imporance comparing to other variables.)
Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  rf_grid_model_15


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 100.0 100.0 15296.0 3.0 3.0 3.0 5.0 8.0 7.53

ModelMetricsRegression: drf
** Reported on train data. **

MSE: NaN
RMSE: NaN
MAE: NaN
RMSLE: NaN
Mean Residual Deviance: NaN

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 0.5302211833272648
RMSE: 0.728162882415236
MAE: 0.5914844633558745
RMSLE: 0.1648031182919969
Mean Residual Deviance: 0.5302211833272648

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.6152322769401325
RMSE: 0.7843674374552608
MAE: 0.6274030124911895
RMSLE: 0.18412820193751578
Mean Residual Deviance: 0.6152322769401325

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.62720394 0.06420952 0.7100078 0.6245481 0.54046416 0.59823227 0.6627676
1 mean_residual_deviance 0.6150031 0.08675425 0.7103293 0.58585465 0.5304058 0.54408133 0.7043445
2 mse 0.6150031 0.08675425 0.7103293 0.58585465 0.5304058 0.54408133 0.7043445
3 r2 0.0010751348 0.006692049 -0.0041329777 -0.0069024027 0.0043207356 0.009815593 0.0022747258
4 residual_deviance 0.6150031 0.08675425 0.7103293 0.58585465 0.5304058 0.54408133 0.7043445
5 rmse 0.7826765 0.055007104 0.84281033 0.76541144 0.72828966 0.7376187 0.83925235
6 rmsle 0.18353629 0.01616218 0.19363096 0.1720045 0.1753295 0.1697018 0.2070147
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance validation_rmse validation_mae validation_deviance
0 2019-12-01 11:28:21 25.379 sec 0.0 NaN NaN NaN NaN NaN NaN
1 2019-12-01 11:28:21 25.381 sec 1.0 NaN NaN NaN 0.725674 0.587099 0.526603
2 2019-12-01 11:28:21 25.382 sec 2.0 NaN NaN NaN 0.722365 0.585057 0.521811
3 2019-12-01 11:28:21 25.384 sec 3.0 NaN NaN NaN 0.727312 0.584177 0.528983
4 2019-12-01 11:28:21 25.386 sec 4.0 NaN NaN NaN 0.726476 0.586191 0.527767
5 2019-12-01 11:28:21 25.388 sec 5.0 NaN NaN NaN 0.727803 0.587630 0.529697
6 2019-12-01 11:28:21 25.389 sec 6.0 NaN NaN NaN 0.726702 0.588862 0.528095
7 2019-12-01 11:28:21 25.391 sec 7.0 NaN NaN NaN 0.725737 0.586757 0.526694
8 2019-12-01 11:28:21 25.393 sec 8.0 NaN NaN NaN 0.727940 0.585781 0.529897
9 2019-12-01 11:28:21 25.395 sec 9.0 NaN NaN NaN 0.727257 0.587056 0.528903
10 2019-12-01 11:28:21 25.397 sec 10.0 NaN NaN NaN 0.725802 0.586985 0.526789
11 2019-12-01 11:28:21 25.399 sec 11.0 NaN NaN NaN 0.726807 0.589313 0.528248
12 2019-12-01 11:28:21 25.401 sec 12.0 NaN NaN NaN 0.727407 0.589359 0.529121
13 2019-12-01 11:28:21 25.403 sec 13.0 NaN NaN NaN 0.727288 0.589604 0.528947
14 2019-12-01 11:28:21 25.405 sec 14.0 NaN NaN NaN 0.728711 0.590567 0.531019
15 2019-12-01 11:28:21 25.407 sec 15.0 NaN NaN NaN 0.729490 0.592981 0.532155
16 2019-12-01 11:28:21 25.409 sec 16.0 NaN NaN NaN 0.728114 0.591346 0.530150
17 2019-12-01 11:28:21 25.411 sec 17.0 NaN NaN NaN 0.729279 0.592599 0.531848
18 2019-12-01 11:28:21 25.414 sec 18.0 NaN NaN NaN 0.729532 0.592808 0.532217
19 2019-12-01 11:28:21 25.416 sec 19.0 NaN NaN NaN 0.729177 0.592718 0.531699
See the whole table with table.as_data_frame()

Variable Importances: 
variable relative_importance scaled_importance percentage
0 R_AVG_KNOWLEDGE 76.530617 1.000000 0.071089
1 AGE 65.032707 0.849761 0.060409
2 GRE_AW 64.440887 0.842028 0.059859
3 MAJOR_3 54.048294 0.706231 0.050205
4 R_AVG_ACADEMIC 53.614399 0.700561 0.049802
5 TEST_WRITE 44.730049 0.584473 0.041550
6 R_AVG_MOT 40.118469 0.524215 0.037266
7 TEST_LISTEN 36.063126 0.471225 0.033499
8 TEST_READ 35.475906 0.463552 0.032954
9 CTZNSHP_6 34.768715 0.454311 0.032297
10 TEST_SPEAK 33.389084 0.436284 0.031015
11 GRE_VERB 32.155384 0.420164 0.029869
12 R_AVG_EMOT 30.912485 0.403923 0.028715
13 GPA 30.032822 0.392429 0.027898
14 CTZNSHP_2 29.715391 0.388281 0.027603
15 HAS_LANGUAGE_TEST_0 27.766647 0.362818 0.025792
16 MAJOR_2 26.204636 0.342407 0.024342
17 R_AVG_RATING 25.493433 0.333114 0.023681
18 R_AVG_RES 20.938150 0.273592 0.019449
19 MAJOR_5 19.935640 0.260492 0.018518
See the whole table with table.as_data_frame()

Using h2o Deep Learning Algorithms to predict the Rating

deeplearning Grid Build progress: |███████████████████████████████████████| 100%
16.69441003402074

See the Model performance

Identify the best model generated with least error

The following tables shows

  1. Model summary (number of layers, number of units, activation function, dropout, metrics, bias and weights)
  2. MSE, RMSE MAE and RMSLE scores on a train data and validation data.
  3. Scoring history (how the model gets lower RMSE depending on number of trees)
  4. Variable importance (It is hard to tell because all of the variables have almost same amount of percentages with a small difference.)
Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dl_grid_model_38


Status of Neuron Layers: predicting RATING, regression, quantile distribution, Quantile loss, 41,345 weights/biases, 504.7 KB, 91,520 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
0 1 63 Input 0
1 2 128 TanhDropout 50 0.1 0.1 0.00990701 0.0031688 0 1.2132e-05 0.000487845 -4.1063e-05 0.000359515
2 3 128 TanhDropout 50 0.1 0.1 0.0103747 0.00315022 0 3.79172e-06 0.000515146 -4.2416e-05 0.000685741
3 4 128 TanhDropout 50 0.1 0.1 0.0105346 0.00309609 0 2.13087e-06 0.000521932 -3.25226e-06 0.000622209
4 5 1 Linear 0.1 0.1 0.00805537 0.00223858 0 -1.36523e-05 0.000439935 0.0544721 1.09713e-154

ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.6203994747649311
RMSE: 0.7876544132834723
MAE: 0.6220297517164883
RMSLE: 0.1856160304470037
Mean Residual Deviance: 0.31101487585824417

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 0.529232347302667
RMSE: 0.7274835718438369
MAE: 0.5849015077094514
RMSLE: 0.16515763954982396
Mean Residual Deviance: 0.2924507538547257

Scoring History: 
timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae training_r2 validation_rmse validation_deviance validation_mae validation_r2
0 2019-12-01 11:44:23 0.000 sec None 0.0 0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2019-12-01 11:44:24 15 min 54.683 sec 5546 obs/sec 10.0 1 4160.0 0.787654 0.311015 0.622030 -0.002974 0.727484 0.292451 0.584902 -0.000122
2 2019-12-01 11:44:29 16 min 0.043 sec 5462 obs/sec 80.0 8 33280.0 0.786843 0.312217 0.624433 -0.000908 0.727524 0.293999 0.587999 -0.000235
3 2019-12-01 11:44:34 16 min 5.370 sec 5473 obs/sec 150.0 15 62400.0 0.786513 0.313734 0.627468 -0.000069 0.727990 0.295568 0.591135 -0.001514
4 2019-12-01 11:44:40 16 min 10.730 sec 5467 obs/sec 220.0 22 91520.0 0.787044 0.311703 0.623406 -0.001420 0.727458 0.293457 0.586914 -0.000051
5 2019-12-01 11:44:40 16 min 10.752 sec 5466 obs/sec 220.0 22 91520.0 0.787654 0.311015 0.622030 -0.002974 0.727484 0.292451 0.584902 -0.000122
Variable Importances: 
variable relative_importance scaled_importance percentage
0 Emphasis Area 3_2 1.000000 1.000000 0.017975
1 HAS_LANGUAGE_TEST_0 0.982987 0.982987 0.017669
2 R_AVG_WRITTEN 0.959405 0.959405 0.017246
3 UU_APPL_CITIZEN_1 0.955407 0.955407 0.017174
4 TEST_READ 0.955111 0.955111 0.017168
5 HAS_LANGUAGE_TEST_1 0.952730 0.952730 0.017126
6 LOW_INCOME_1 0.951864 0.951864 0.017110
7 TEST_SPEAK 0.950476 0.950476 0.017085
8 Emphasis Area 2_4 0.943268 0.943268 0.016956
9 Emphasis Area 2_3 0.933305 0.933305 0.016776
10 R_AVG_KNOWLEDGE 0.930153 0.930153 0.016720
11 R_AVG_ORAL 0.928679 0.928679 0.016693
12 UU_APPL_NTV_LANG_5 0.928102 0.928102 0.016683
13 R_AVG_MOT 0.927248 0.927248 0.016668
14 CTZNSHP_3 0.922518 0.922518 0.016583
15 GRE_AW 0.921495 0.921495 0.016564
16 MAJOR_1 0.920395 0.920395 0.016544
17 GRE_QUANT 0.916672 0.916672 0.016477
18 UU_APPL_NTV_LANG_4 0.912989 0.912989 0.016411
19 LOW_INCOME_2 0.909654 0.909654 0.016351
See the whole table with table.as_data_frame()

Compare Model Performances

We will compare these three models (GBM, RF and Deep Learning on a test set (that was part of training set).)

ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.5459165845380103
RMSE: 0.7388616816008327
MAE: 0.5904200582511832
RMSLE: 0.16617257222569887
Mean Residual Deviance: 0.5459165845380103


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.5466279437877091
RMSE: 0.7393429135304599
MAE: 0.5908274656404623
RMSLE: 0.16619937241217192
Mean Residual Deviance: 0.5466279437877091


ModelMetricsRegression: deeplearning
** Reported on test data. **

MSE: 0.5551326289092539
RMSE: 0.7450722306657616
MAE: 0.5946641517622181
RMSLE: 0.16819793979237863
Mean Residual Deviance: 0.29733207588110905

Prediction of Model on a original test data

 

Combine all the results for predicting the rating variable

Name of the model RMSE
LASSO Regression 0.793411
Epsilon-Support Vector Regression 0.809163
Huber Regressor 0.811865
Ridge Regression 0.813751
Stochastic Gradient Descent 0.817268
Stacking Approach(Blending) 0.817234
H2O GBM 0.838101
H2O RF 0.838119
H2O AutoML 0.840311
H2O DL 0.850005

Conclusion

The structure of the data contains both categorical and numerical variable. Target variables are DECISION variable(categorical) and RATING variable(numerical).

We started with two goals at the beginning of this jupyter notebook. The first one is to predict the decision variable that determines if the student is admitted, rejected or waitlisted. In this case, recall that we have used the rating variable to predict the decision variable. The second goal was to predict the rating variable. Our first problem is classification problem whereas the second one is regression problem.

The methods we have used includes Decision Tree, Logistic Regression Random Forest ,Stochastic Gradient Descent, k-nearest neighbor, Gaussian Naive Bayes, Perceptron, Support Vector Machine, Adaptive Boosting, XGBoost, Stacking(Blending) Ensemble and h2o functions such as AutoML, GBM, and Deep Learning.

We have found that Stochastic Gradiend Descent with grid search is the best model to predict the decision variables. Surprisingly, it even gives better result thatn h2oAutoML function.

To predict the RATING variable, we have used very complicated models, mixed bunch of models in order to minimize the error. However, we have found that basic models such as Lasso regression and epsilon support vector regression gave lower RMSE score comparing to other complicated models including neural network.

The number of predictor variables is quite large and it is not initially clear what the most significant predictor variables will be. The results of these analysis point to a common choice of the most relevant predictor variables:

  • Applicant age
  • Applicant GPA
  • Applicant’s choice of emphasis area for study in graduate school
  • Applicant’s GRE scores
  • Applicant’s undergraduate major
  • Applicant's average ratings of knowledge given by applicant's recommenders

Even though one might expect that these variables should have high predictive ability it is not clear which should be most predictive. A surprising output of our analysis is that age is found to be highly predictive, at least for certain models. This was somewhat unexpected as most applicant’s tend to be of a similar age in their early twenties. There are some outliers in their early to late thirties and perhaps the presence of these applicants naturally splits the dataset, making the age variable an easy predictor to split upon and reduce the classification or regression error. Also of some interest is the fact that among the GRE scores the verbal one tends to have slightly more predictive ability than the quantitative one. This is not entirely unsurprising, as the quantitative scores among Math PhD students tend to be fairly homogeneous. There is greater variability within the verbal scores, and apparently there is some sort of positive correlation between verbal abilities and a high rating being given to the student application. Whether this is deliberate or not on the part of the reviewers is unknown.

In both cases, prediction the decision and predicting the rating variable, we see that some simpler models gave us better result comparing to more complicated models. One possible explanation of this could be the relationship between our variables and the response is linear. Another explanation would be the number of sample is not too many.

For the future work, one might try multiple imputation technique instead of basic method to fill the missing values.

References